Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core-clp: Add support for decompressing a specific file split from a clp archive into one or more IR files. #417

Merged
merged 57 commits into from
Jun 8, 2024

Conversation

haiqi96
Copy link
Contributor

@haiqi96 haiqi96 commented May 28, 2024

References

Description

The change is motivated by the need to support log viewer, which

  1. Requires the input to be in the IR format.
  2. Requires the IR to be reasonably small to not overflow the memory.

This PR introduces a new decompression interface decompress_ir that decompress a file split into one or multiple IRs.

The function takes in an original file ID, a specific message index and a threshold.
It first find the file split which contains the message index, and decompress the split into one or more IR; the function creates a new IR whenever the current IR's raw size (i.e. not zstd compressed) is greater than the given threshold.
Each IR follows the IRv1 format, meaning it has the complete preamble and and EoF byte, and can be deserialized individually.

The generated IR use the naming format: <FILE_ORIG_ID><begin_message_ix><end_message_ix>.clp.zst. Since the preamble of the IRv1 doesn't contain any log event index information, this name is essential for the user of the IR to know what's the range of log index the IR contains.

The PR also introduces a new class LogEventSerializer.cpp that serialized a plain text message into the IR format.

Due to the limitation of our current IR related encoding APIs, the function is designed with two inefficienies

  1. The serialized data is first written into an internal buffer, instead of directly writing to the on-disk file.
  2. The serializer takes in a plain text message, meaning that the encoded message in the archive needs to be first decompressed into plain text (which introduce extra overhead) before being serialized.

We agreed that these two are acceptables as properly supporting the flow will take more thoughts on reworking the encoder interface.

Validation performed

The validation is not directly performed on this PR, but on a following PR which adds the decompress_ir in to the execution path of clp executable.

To validate the functionality, we compressed a 64MB file into archive(s). We then decompressed it into mulitple IRs, decoded and concatnate them, and did a binary comparison with the original file.

We used two configuration to cover all the possible cases:

  1. Compressed a 64MB hadoop log using smaller encoded file size and archive size, such that it splits the original file into 3 splits across 2 archives. We then decompressed all 3 IRs by running clp 3 times, using different message index

  2. Compressed the 64MB hadoop log using default settings, so only one file and archive was generated. We then decompressed the IR using a 32MB threshold, generating 3 IRs on disk.

@haiqi96 haiqi96 changed the title Archive to ir Draft: Archive to IR May 28, 2024
@haiqi96 haiqi96 force-pushed the ArchiveToIR branch 3 times, most recently from 7d848a4 to 425377b Compare May 29, 2024 14:34
components/core/src/clp/ir/LogEventSerializer.hpp Outdated Show resolved Hide resolved
components/core/src/clp/ir/LogEventSerializer.hpp Outdated Show resolved Hide resolved
components/core/src/clp/ir/LogEventSerializer.hpp Outdated Show resolved Hide resolved
components/core/src/clp/ir/LogEventSerializer.cpp Outdated Show resolved Hide resolved
components/core/src/clp/ir/Constant.hpp Outdated Show resolved Hide resolved
components/core/src/clp/ir/LogEventSerializer.cpp Outdated Show resolved Hide resolved
components/core/src/clp/clp/FileDecompressor.cpp Outdated Show resolved Hide resolved
components/core/src/clp/clp/FileDecompressor.cpp Outdated Show resolved Hide resolved
@haiqi96 haiqi96 changed the title clp-core: Add support for decompressing an IR from a specific file split in the clp-archive core-clp: Add support for decompressing an IR from a specific file split in the clp-archive Jun 5, 2024
LinZhihao-723
LinZhihao-723 previously approved these changes Jun 5, 2024
Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit message suggestion:
core-clp: Add support for decompressing an IR from a specific file split from a clp archive.
I've done my parts of review, maybe you can take it over from here @kirkrodrigues

components/core/src/clp/ir/Constant.hpp Outdated Show resolved Hide resolved
components/core/src/clp/ir/Constant.hpp Outdated Show resolved Hide resolved
components/core/src/clp/ir/LogEventSerializer.hpp Outdated Show resolved Hide resolved
components/core/src/clp/ir/LogEventSerializer.hpp Outdated Show resolved Hide resolved
components/core/src/clp/ir/LogEventSerializer.hpp Outdated Show resolved Hide resolved
components/core/src/clp/clp/FileDecompressor.cpp Outdated Show resolved Hide resolved
components/core/src/clp/clp/FileDecompressor.cpp Outdated Show resolved Hide resolved
components/core/src/clp/clp/FileDecompressor.cpp Outdated Show resolved Hide resolved
@kirkrodrigues
Copy link
Member

Forgot to mention, since we now have a log event serializer, can we add some unit tests to test serialization + deserialization?

{
SPDLOG_ERROR(
"Failed to create directory structure {}, errno={}",
output_dir.c_str(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
output_dir.c_str(),
temp_output_dir.c_str(),

Comment on lines 59 to 69
if (false == res) {
close_writer();
return true;
}

m_is_open = true;

// Flush the preamble
flush();

return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (false == res) {
close_writer();
return true;
}
m_is_open = true;
// Flush the preamble
flush();
return false;
if (false == res) {
close_writer();
return false;
}
m_is_open = true;
// Flush the preamble
flush();
return true;

we should return false to indicate error right?

}
begin_message_ix = end_message_ix;

if (auto const error_code = ir_serializer.open(temp_ir_path.string());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not error_code


LogEventSerializer<four_byte_encoded_variable_t> ir_serializer;
// Open output IR file
if (auto const error_code = ir_serializer.open(temp_ir_path.string());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not error_code

LinZhihao-723
LinZhihao-723 previously approved these changes Jun 7, 2024
Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, how about:

core-clp: Add support for decompressing a specific file split from a clp archive into one or more IR files.

@haiqi96 haiqi96 changed the title core-clp: Add support for decompressing an IR from a specific file split in the clp-archive core-clp: Add support for decompressing a specific file split from a clp archive into one or more IR files. Jun 8, 2024
@kirkrodrigues kirkrodrigues merged commit 040eb51 into y-scope:main Jun 8, 2024
11 checks passed
@haiqi96 haiqi96 deleted the ArchiveToIR branch June 28, 2024 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants